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Abstract 

The topic of recovery of a structured model given a small number of linear observations has been well- 
studied in recent years. Examples include recovering sparse or group-sparse vectors, low-rank matrices, 
and the sum of sparse and low-rank matrices, among others. In various applications in signal processing 
and machine learning, the model of interest is known to be structured in several ways at the same time, 
for example, a matrix that is simultaneously sparse and low-rank. An important application is the sparse 
phase retrieval problem, where the goal is to recover a sparse signal from phaseless measurements. In 
machine learning, the problem comes up when combining several regularizers that each promote a certain 
desired structure. 

Often penalties (norms) that promote each individual structure are known and yield an order-wise 
optimal number of measurements (e.g., l\ norm for sparsity, nuclear norm for matrix rank), so it is 
reasonable to minimize a combination of such norms. We show that, surprisingly, if we use multi- 
objective optimization with the individual norms, then we can do no better, order-wise, than an algorithm 
that exploits only one of the several structures. This result suggests that to fully exploit the multiple 
structures, we need an entirely new convex relaxation, i.e., not one that is a function of the convex 
relaxations used for each structure. We then specialize our results to the case of sparse and low-rank 
matrices. We show that a nonconvex formulation of the problem can recover the model from very 
few measurements, on the order of the degrees of freedom of the matrix, whereas the convex problem 
obtained from a combination of the l\ and nuclear norms requires many more measurements. This proves 
an order-wise gap between the performance of the convex and nonconvex recovery problems in this case. 

Keywords, compressed sensing, convex relaxation, rcgularization, performance bounds 

1 Introduction 



Recovery of a structured model (signal) given a small number of linear observations has been the focus 
of many studies recently. Examples include recovering sparse or group-sparse vectors (which gave rise to 
the area of compressed sensing [1, 2, 3], low-rank matrices [4, 5], and the sum of sparse and low-rank 
matrices [6, 7], among others. More generally, the recovery of a signal that can be expressed as the sum of 
a few atoms out of a appropriate atomic set, has been studied in [8]. Canonical questions that have guided 
research in this area include: How many generic linear measurements are enough to recover the model by any 
means? How many measurements are enough for a tractable approach, e.g., solving a convex optimization 
problem? In the statistics literature, these questions are posed in terms of "sample complexity" and error 
rates for estimators minimizing the sum of a quadratic loss function and a regularizer that reflects the desired 
structure [9]. 
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In practice, there are many cases where the model of interest is known to be structured in several ways 
at the same time. We then seek a signal that lies in the intersection of several sets defining the individual 
structures (in a sense that we will make precise later). Such recovery problems arise often in applications, 
for example in signal processing (Section 1.2) as well as statistical learning. The most common convex 
regularizer (norm penalty), used to promote all structures together, is a linear combination of well-known 
regularizers for each structure. However, there is currently no general analysis and understanding of how 
well such regularization performs, in terms of the number of observations required for successful recovery 
of the desired model. This paper addresses this ubiquitous yet unexplored problem; i.e., the recovery of 
simultaneously structured models. 

An example of a simultaneously structured model is a matrix that is simultaneously sparse and low-rank. 
In this case, one would like to come up with algorithms that exploit both types of structures to minimize the 
number of measurements required for recovery. An nxn matrix with rank r ^ n can be described by 0{rn) 
parameters, and can be recovered using 0(rn) generic measurements via nuclear norm minimization [4, 10]. 
On the other hand, a block-sparse matrix with a. k x k nonzero block where k <^ n can be described by fc^ 
parameters and can be recovered with 0{k^ log(n/fc)) generic measurements using £i minimization. However, 
a matrix that is both rank r and block-sparse can be described by 0{rk) parameters. The question is whether 
we can exploit this joint structure to efficiently recover such a matrix with 0{rk) measurements. 

In this paper we answer this question in the negative, in the following sense: if we use multi-objective 
optimization with the £i norm and the nuclear norm (used for sparse signals and low rank matrices, respec- 
tively), then the number of measurements required is lower bounded by 0(min{fc^, rn}). In other words, we 
need at least this number of observations for the desired signal to lie on the Pareto optimal front traced by 
the objectives, the £i norm and the nuclear norm. This means we can do no better than an algorithm that 
exploits only one of the two structures. 

We introduce a framework to express general simultaneous structures, and as our main result, we prove 
that the same phenomenon happens for a general set of structures under reasonable assumptions on the norm 
penalties used. These assumptions hold in many typical cases of interest, such as combinations of sparse, 
group-sparse, and low-rank structures. The measurements we consider are generic measurements; we focus 
on random Gaussian measurement matrices, with independent and identically distributed entries. This gives 
an open, dense subset of the set of all m-measurement matrices, hence justifying the term "generic" . 

Table 1 summarizes known results on recovery of some common structured models, along with a result 
of this paper specialized to the problem of low-rank and sparse matrix recovery. The first column gives 
the number of parameters needed to describe the model (often referred to as its 'degrees of freedom'), the 
second and third columns show how many generic measurements are needed for successful recovery. In 
using 'nonconvex recovery', we assume we are able to find the global minimum of a nonconvex problem. 
This is clearly intractable in general, and not a practical recovery method — we consider it as a benchmark 
for theoretical comparison with the (tractable) convex relaxation, in order to determine how powerful the 
relaxation is. 

The first and second rows are the results on k sparse vectors in M" and rank r matrices in R"^" respec- 
tively, [11, 10]. The third row considers the recovery of "low-rank plus sparse" matrices. Consider a matrix 
X e K"^" that can be decomposed as X = X^ + X5 where X^ is a rank r matrix and X5 is a matrix with 
only k nonzero entries. The degrees of freedom of X is 0{rn + k). Minimizing the infimal convolution of £1 
norm and nuclear norm, i.e., /(X) = miuY ||Y||j, -I- A||X — Yj|i subject to random Gaussian measurements 
on X, gives a convex approach for recovering X. It has been shown that under reasonable incoherence 
assumptions, X can be recovered uniquely from 0{{rn + k) log^ n) measurements which is suboptimal only 
by a logarithmic factor [12]. Finally, the last row in Table 1 shows one of the results in this paper. Let 
X S K"^" be a rank r matrix whose entries are zero outside a fci x ^2 submatrix. The degrees of freedom 
of X is 0{{ki -\- k2)r). We consider both convex and non-convex programs for the recovery of this type of 
matrices. The nonconvex method involves minimizing the number of nonzero rows, columns and rank of the 
matrix jointly, as discussed in Section 3.2. As it will be shown later, 0((fci 4-^2)^ log n) measurements suffices 
for this program to successfully recover the original matrix. The convex method minimizes any convex com- 
bination of the individual structure-inducing norms, namely the nuclear norm and the -^1,2 norms of the rows 
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Model 


Degrees of Freedom 


Nonconvex recovery 


Convex recovery 


Sparse vectors 


k 


0{k) 


O(fclog^) 


Low rank matrices 


r{2n — r) 


0{rn) 


0(rn) 


Low rank plus sparse 


0{rn + k) 


not analyzed 


0{{rn + k)log^n) 


Low rank and sparse 


0{r{k,+k-2)) 


0(r(fci + fc2) log 77.) 


Q{rn) 



Table 1: Summary of results in recovery of structured signals. This paper shows a gap between the performance of 
convex and nonconvex recovery programs for simultaneously structured matrices (last row). 

and columns, [13, 14]. We show that with high probability this program cannot recover the original matrix 
with fewer than fl(rn) measurements. In summary, while nonconvex method performs slightly suboptimal, 
the convex method performs poorly as the amount of measurements scales with 77, rather than ki + k2 ■ 

1.1 Contributions 

This paper describes a general framework for analyzing the recovery of models that have more than one 
structure, by combining penalty functions corresponding to each structure. The framework proposed in- 
cludes special cases that are of interest in their own right, e.g., sparse and low-rank matrix recovery. Our 
contributions can be summarized as follows. 

Poor performance of convex relaxations. We consider a model with several structures with the as- 
sumption that all structure-inducing norms are decomposable at the true input signal Xq (see Section 2). 
For recovery, we consider a multi-objective optimization problem to minimize the individual norms simulta- 
neously. Using Pareto optimality, we know that minimizing a weighted sum of the norms and varying the 
weights traces out all points of the Pareto-optimal front (i.e., the trade-off surface, Section 2). We obtain a 
lower bound on the number of measurements, that holds no matter what the weights are and no matter what 
function is used to trace the points on the Pareto-optimal front. A sketch of our main result is as follows. 

Given a model (signal) xq with t simultaneous structures satisfying certain conditions, the number 
of generic measurements required for recovery with high probability using any linear combination 
of the individual norms satisfies the lower bound 

m> c dmin = c min di 

l,...,r 

where di is approximately on the order of the number of measurements required if minimizing the 
ith norm only. The constant c depends on the individual norms, as well as the relative geometry 
of their norm balls at xg . 

With dniin as the bottleneck, this result implies that the combination of norms can perform no better than 
using only one of the norms, even though the target model is tightly constrained and has a very small degree 
of freedom. 

Incorporating general cone constraints. Our results incorporate side information on Xq, expressed as 
convex cone constraints. This additional information about the signal helps in recovery; however, quantifying 
how much the cone constraints can help is not trivial. Our analysis explicitly determines the role of the cone 
constraint: Geometric properties of the cone such as its Gaussian width determine the constant factors in 
the bound on the number of measurements. 

Sparse and Low-rank matrix recovery: illustrating a gap. As a special case, we consider the 
recovery of simultaneously sparse and low-rank matrices and prove that there is a significant gap between 
the performance of convex and non-convex recovery programs. This gap is surprising when one considers 
similar results in low-dimensional model recovery discussed above in Table 1. 
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1.2 Some applications 

Here we survey several applications of the sparse and low-rank matrix recovery problem, as well as existing 
results specific to these applications. A related problem is the Sparse Principal Component Analysis prob- 
lem, mentioned in Section 7. Other applications include Collaborative Hierarchical Sparse Modeling [f5], 
where sparsity is considered within the non-zero blocks in a block-sparse vector and the recovery of hyper- 
spectral images where we aim to recover a simultaneously block sparse and low rank matrix from compressed 
observations, [f6]. 

Sparse signal recovery from quadratic measurements. Sparsity has long been exploited in signal 
processing, applied mathematics, statistics and computer science for tasks such as compression, dcnoising, 
model selection, image processing and more. Despite the great interest in exploiting sparsity in various appli- 
cations, most of the work to date has focused on recovering sparse or low rank data from linear measurements. 
Recently, the basic sparse recovery problem has been generalized to the case in which the measurements are 
given by general nonlinear transforms of the unknown input, [18]. A special case of this more general setting 
is quadratic compressed sensing introduced in [17] in which the goal is to recover a sparse vector x from 
quadratic measurements bi = xAjX"'". This problem can be linearized by lifting, where we wish to recover a 
"low rank and sparse" matrix X = xx^ subject to measurements bi = (Aj,X). 

Sparse recovery problems from quadratic measurements arise in a variety of different problems in optics. 
One example is sub-wavelength optical imaging [19, 17] in which the goal is to recover a sparse image from its 
far-field measurements, where due to the laws of physics the relationship between the (clean) measurement 
and the unknown image is quadratic. In [17] the quadratic relationship is a result of using partially-incoherent 
light. The quadratic behavior of the measurements in [19] arises from coherent diffractivc imaging in which 
the image is recovered from its intensity pattern. Under an appropriate experimental setup, this problem 
amounts to reconstruction of a sparse signal from the magnitude of its Fourier transform. 

Sparse phase retrieval. Quadratic measurements appear in phase retrieval problems, in which a signal 
is to be recovered from the magnitude of its measurements bi = |a^x|, where each measurement is a linear 
transform of the input x S M" and SLi are arbitrary possibly complex-valued measurement vectors. Phase 
retrieval is of great interest in many applications such as optical imaging [20, 21], crystallography [22], and 
more [23]. 

The problem becomes linear when x is lifted and we consider the recovery of X = xx^ where each 
measurement takes the form 6f = (aia^,X). In [17], an algorithm was developed to treat phase retrieval 
problems with sparse x based on a semidefinite relaxation, and low-rank matrix recovery combined with a 
row-sparsity constraint on the resulting matrix. More recent works also proposed the use of semidefinite 
relaxation together with sparsity constraints for phase retrieval [24, 25, 26, 27]. An alternative algorithm was 
recently designed in [28] based on a greedy search. In [26], the authors also consider sparse signal recovery 
based on combinatorial and probabilistic approaches and give uniqueness results under certain conditions. 
Stable uniqueness in phase retrieval problems is studied in [29]. The results of [30, 31, 32, 33] applies to 
general (non-sparse) signals where in some cases masked versions of the signal arc required. 

The problem of recovering a signal from the magnitude of its Fourier transform has been studied exten- 
sively in the literature. Many methods have been developed for phase recovery [23] which often rely on prior 
information about the signal, such as positivity or support constraints. One of the most popular techniques 
is based on alternating projections, where the current signal estimate is transformed back and forth between 
the object and the Fourier domains. The prior information and observations are used in each domain in order 
to form the next estimate. Two of the main approaches of this type are Gerchberg-Saxton [35] and Fienup 
[34]. In general, these methods are not guaranteed to converge, and often require careful parameter selection 
and sufficient signal constraints in order to provide a reasonable result. Approaches based on semidefinite 
relaxation or the recently proposed greedy methods appear to lead to far superior recovery and convergence 
results. 
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1.3 Outline of the paper 

The paper is structured as follows. Background, definitions, and assumptions are given in Section 2. An 
overview of the main results is provided in Section 3. The proofs of the general results are presented in 
Section 4. The proofs for the special case of simultaneously sparse and low-rank matrices are discussed 
in Section 5, where we compare corollaries of the general results with the results on non-convex recovery 
approaches and illustrate a gap. Numerical simulations in Section 6 empirically support the results on sparse 
and low-rank matrices. Future directions of research and discussion of results are in Section 7. 

2 Problem Setup 

Wc begin by reviewing some basic definitions. For a vector x G K", ||x|| denotes a general norm and 
||x||* = supji^ii^j (x, z) is the corresponding dual norm. A subgradicnt of the norm || • || at x is a vector g 
for which ||z|| > ||x|| + (g, z — x) holds for any z. The set of all subgradicnts is called the subdifFcrcntial, 
denoted by 9||x||. We consider finite dimensional spaces and convex functions, so the subdifFercntial 9||x|| is 
always a compact convex set. For a convex set M and point x, we define the projection operator as 

T'mIx) = arg min ||x - u||2. 

For a subspace M, denote its orthogonal complement by M-^. The set of n x n positive semidcfinitc (PSD) 
and symmetric matrices will be denoted by S" and §" respectively. Given a cone C, denote its dual by C*, 
defined as 

C* = {z| (z,v) > for all V e C}. (2.1) 

The polar cone is denoted by C° and is given by C° = —C*. 

When considering simultaneously structured signals, we restrict our attention to structures associated 
with norms that satisfy a decomposibility property, defined next. The definition is sufficiently general to 
cover popular structure-inducing norms. Examples include the £i norm which induces vector sparsity, the 
£i^2 norm which induces column sparsity in a matrix, and the nuclear norm which promotes a low-rank 
matrix. The nuclear norm gives the summation of the singular values of a matrix and it will be denoted as 

II -lU- 

Definition 2.1 (Decomposable Norm) A norm \\ ■ \\ is decomposable at x G K" if there exist a subspace 
T C M" and a vector e G T such that the subdifferential at x has the form 

dM ={ze R" : Vt{z) = e , \\VtAz)\\* < 1}, (2.2) 

and for all s € T"*- we have 

||s|| = sup (s,z). (2.3) 

zeT^,||z||*<i 

We refer to T as the support and e as the sign vector o/x with respect io |j • |] . 

Definition 2.1 is used in [36], and is closely related to the one given in [12] where the authors assume 
non-expansiveness of Vt^ with respect to dual norm || • j|* instead of (2.3); that is, ||7'7i_l(x)||* < ||x||* holds 
for all X € R" (this condition implies (2.3)). 

The following lemma gives a useful relation between x and the corresponding sign and support. 

Lemma 2.1 Let \\ ■ j|,x, e and T be as in Definition 2.1. We have that x G T and (x, e) > 0. 

Proof. Using the definition of subgradient of a norm, it can be shown that for any g G 9||x|| we have 
(x, g) = ||x||, [37]. From (2.2), we know that e G 9||x||, hence (x, e) = ||x|| > 0. Now, using (2.3), choose 
z G T-L such that (x,z) = \\VT^{y^)\\ and |lz|l* < 1. Then e z G (9||x|| and 

||x|| = (x,e + z) = ||x|| + ||-Pj._l(x)||, (2.4) 
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implying |lPy_L(x)|| = or T'yi (x) = 0, which means x G T. ■ 

To give some intuition for Definition 2.1, we review examples of norms that arise when considering 
simultaneously sparse and low rank matrices. For a matrix X G R"^^"^, let Xj j, and X j- denote its 
(i, j) entry, ith row and jth column respectively. 

Lemma 2.2 (see [36]) £i,ii,2 md \\ ■ are decomposable as follows. 

• £i norm is decomposable at every x e M", with sign e = sgn(x) , and support as 

T = supp (x) = {y e M" : x, = y, = for i ^ 1, . . . ,n} . 

• £i_2 norm is decomposable at every X G M"i^"^. The support is 

T = {Y e ]R"iX"2 : X. = ^ Y.,, = /or i = 1, . . . , riz} , 

and the sign vector e G M"^^"^ is obtained by normalizing the columns of present in the support, e..j = 
if ||X..j||2 7^ 0, and setting the rest of the columns to zero. 

• Nuclear norm is decomposable at every X £ R"!^"^. For a matrix X with rank r and compact singular 
value decomposition X = USV"^ where S S R''^'', we have e = UV"^ and 

T={Y e R"ix"2 : (I _ UU^)Y(I - VV^) = O} 
= {ZiV^ + UZf I Zl e M"!^'-^ Za e M"^^''} . 

The reader is referred to [36] for a more detailed discussion. Combining these examples with Table 1, it can 
be observed that the degrees of freedom for a sparse signal, column sparse matrix and a low rank matrix is 
same as the dimension of the supports for £i norm, £1^2 norm and nuclear norm respectively. 




Figure 1: An example of a decomposable norm: £1 norm is decomposable at xo = (IjO). The sign vector e, the 
support T, and shifted subspace are illustrated. A subgradient g at xo and its projection onto are also shown. 



Definition 2.2 Given a norm \\ ■ \\ decomposable at x, define the constant k as 

6 2 



L J dim(T) ' 

where e and T are the sign and support ofx. with respect to j| • j| , and L is the Lipschitz constant of the norm, 
namely, 

HZ1H-HZ2|| 

L= sup — — . (2.5) 

zi5^z2eK" IIZ1-Z2II2 
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The notation || • ||2 denotes the Euchdean norm, i.e., the (.2 norm for vectors and the Frobenius norm |] • \\f 
for matrices. Note that L is a global property of the norm while e, T and k depend on both the norm and 
the point under consideration (dccomposability is a local property in this sense). As it has been discussed 
in Lemma 2.2, the common norms £1, £1^2, and nuclear norm are decomposable at every point. The next 
lemma is an immediate consequence of Lemma 2.2 and states that k is simply a bounded constant. 

Lemma 2.3 The value of k for the £1,^1,2 o.'rid \\ ■ norms are given as follows. 

• £1 norm has Lipschitz constant L = sup.^_^Q = \/n. For a k-sparse vector x, we have ||e||2 = 
||s(?n(x)||2 = y/k and dim(T) — k, which yield k = 1 for any x. 

• £1.2 norm has Lipschitz constant L ~ supx^o ^||^||^^^ ~ \/n2- At any X with k nonzero columns, we have 
\\g\\f = ^Jk and dim(T') = kn\ which gives k = 1. 

• Nuclear norm has a Lipschitz constant L = supx^o || x || f ^ assuming ni > 712. For a given 
matrix X with rank r, we have dim(r) = r{ni + ^2 — r) and \\e\\p = ^fr . These give k = which 
satisfies i < ^^^^^ < < 1 foi' o,ny X. 

2.1 Simultaneously structured models 

We consider a signal xq having several low dimensional structures 5i, ^2, Sr simultaneously (e.g., 
sparsity, group sparsity, low-rank, etc.). Suppose these structures each correspond to a norm that promotes 
them (e.g., £1, €1^2, nuclear norm, etc.). We further assume these norms arc decomposable (e.g., all the 
mentioned norms). While these assumptions may seem restrictive, they cover many cases of interest, for 
example all variations of the "sparse and low rank" matrices (see Section 3.2) We refer to such an xq a 
simultaneously structured model. 

To sec why £1 norm is associated with sparsity, in Lemma 2.2, observe that the support T (for £1 norm) 
corresponds to the coordinates where xq is nonzero and the dimension of the support is equal to the sparsity 
of the signal. Similarly, for the £1,2 norm (the nuclear norm), dimension of the support depends only on the 
number of nonzero blocks (rank) of the matrix. In particular, the dimension of the support is equal to the 
degrees of freedom of a signal having the corresponding structure. In case of rank r matrices, this is equal to 
r(ni + 71,2 — r). Hence, all these norms are decomposable and are inherently connected to the corresponding 
low dimensional structure. 

We will now introduce the relevant notation for a simultaneously structured signal xq. Given {|j • ]^ 
and Xq denote the corresponding supports by {Ti}J^^ and the sign vectors by {ei}J^^ (see Definition 2.1). 
Let Tn — 01=1 denote the joint support of Xq. Moreover, suppose Xq has other properties that can be 
expressed as a cone constraint xq G C; an example is the positive semi-definiteness (PSD) constraint in the 
sparse phase retrieval problem mentioned in Section 1. Naturally, this additional information should help in 
recovery. To characterize the effect of the cone and the decomposable structures, consider the subspace 

n ^ rnnspan({yGM"|(xo,y) =0, yeC*})^, (2.6) 

where span(-) returns the linear span of the elements of the set. Observe that xq G TZ. When there is no 
cone constraint (i.e., C = M"), we have 7Z = Td. Similarly, if span(C) is equal to K" and xq lies in the interior 
of C, we will again have TZ = T^. In general, the second term on the right hand side of (2.6) plays a critical 
role in our analysis when Xq lies on the boundary of C. An example to this will be the "sparse and low rank 
matrices" which is discussed in Section 5. The following definition quantifies the angle of individual sign 
vectors with this subspace. 

Definition 2.3 Define en.i = Vn{'^i) and let 



which is the cosine of the angle between and subspace TZ. 
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Figure 2: A signal with two simultaneous structures and the associated definitions. 



While Ki captures the properties of the ith structure at the given point, 6i reflects the relation between the 
different structures and the given cone. The quantities 

K, ~ min Ki , 6 = min 9i 

i— l,...,r i—l.....T 

will be used in the statement of results. Note that 6 determines the maximum "spread" of vectors {ei}[^;^ 
from TZ. 



2.2 Convex recovery program 

We will be investigating the recovery of the simultaneously structured Xo from its linear measurements t/(xo). 
To recover the signal xq, we would like to simultaneously minimize the norms || • i = 1, . . . , r, which leads 
to a multi-objective (vector-valued) optimization problem. For all feasible points x satisfying C/(x) = t/(xo) 
and X G C, consider the set of achievable norms {||x||(j)}JL^ denoted as points in W , as shown in Figure 3. 
Since the norms and the constraints are convex, the set of achievable values is also convex [38, Chapter 4]. 
The minimal points of this set, with respect to the partial order induced by the positive orthant form 
the Pareto- optimal front. 

Definition 2.4 (Recoverability) We callyiQ recoverable if it is a Pareto optimal point; i.e., there does not 
exist any feasible x' 7^ x satisfying tJ(x') = ^(xq) and x' € C, with ||x'||(j) < ||xo|l(,;) for i = 1, . . . , r . 

Our vector- valued convex recovery program can be turned into a scalar optimization problem as 

minimize /(x) = /i(|lx||(i), . . . , ||x||(^)) 
subject to (?(x) = ^y(xo), 

where h : — > is chosen to be increasing with respect to the order induced by W!^ . For convex problems 
with strong duality, it is known that the only scalarizing function h we need to produce all points Xq on 
the Pareto optimal front is the weighted sum /(x) = X]r=i -^i||x|l(i)i where are positive weights and can 
be obtained as the coefficients of a supporting hyperplane at Xq (see, e.g., [38, Chapter 4]). Alternatively, 
another scalar objective function that can be used is the weighted maximum 

/(x)=.max _^||x||(,). (2.8) 

J=l,...,r ||Xo||(,) 

We sometimes prefer to use this function instead of the weighted sum, since the weights corresponding to xq 
are given explicitly. In particular, as it is discussed in Lemma 4.1, this function is the best choice to recover 
xq via (2.7). 

In Figure 3, consider the smallest m that makes xq recoverable (i.e., whose corresponding achievable set 
has Xq as a Pareto optimal point). Then one can choose a function h and recover xq by (2.7) using the m 
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measurements. If the number of measurements is any less, then no function can recover xq. Our goal is to 
provide lower bounds on m. 




Figure 3: Nested sets of achievable values shrink as the number of measurements grow, and they all contain xq. We 
need at least m measurements for xo to be recoverable, since for any m < m, xo is not on the Pareto optimal front. 

Throughout this paper, wc consider Gaussian measurements, defined as follows. 

Definition 2.5 (Gaussian measurement operator) Q{-) : M" E'" is a Gaussian measurement op- 
erator if Q{'X.) is equivalent to the matrix multiplication Gx where G £ ]]j'"X" /j^g i.i.d. standard normal 
entries. 

Note that in [8] , Chandrasekaran et al. propose a general theory for constructing a suitable penalty, called 
an atomic norm, given a single set of atoms that describes the structure of the target object. In the case 
of simultaneous structures, this construction requires defining new atoms, and then ensuring the resulting 
atomic norm can be minimized in a computationally tractable way, which is nontrivial and often not easy. 
We briefly discuss such constructions as a future research direction in Section 7. 

3 Main Results: Theorem Statements 

In this section, we state our main theorems that aim to characterize the number of measurements needed to 
recover a simultaneously structured signal by convex or nonconvex programs. Wc first present our general 
results, followed by results for simultaneously sparse and low-rank matrices as a specific but important 
instance of the general case. The proofs are given in Section 4. 

3.1 General simultaneously structured signals 

This section deals with the recovery of a signal xo that is simultaneously structured with Si, S2, ■ ■ ■ , St as 
described in Section 2.1. We give a lower bound on the required number of measurements, that is determined 
by the support Ti with the smallest dimension. Before stating the theorem, we will give a relevant definition 
regarding the "size" of a cone. 

Definition 3.1 (Cone width) Let Ai be a closed convex cone in R" and let v be a vector with i.i.d. 
standard normal entries. Then, the normalized width of Ai is defined as 

r,iM) = m^^pM (3.1) 
Note that we always have r]{M) = '^"1^^^^"^' < MM < 1. 
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Theorem 3.1 Consider the programs in (2.7) using m generic measurements. Let 



. n \\Vn{^)\\l 
8l7^ gea/(xo) ||g||^ 

and 7 = i-))(c°) • Then, whenever m < niin{ "^^"g^*" -^^ mo}, Xq will not be a minimizer for any of the 
programs in (2.7) with probability at least 1 — Ci cxp(— C2 minjmo, n{\ — '7(C°))^}) for some positive constants 
Ci and C2- 

In addition, observe that for a smaller cone C, r?(C°) will be larger. Hence, it is reasonable to expect a 
smaller lower bound to the required number of measurements. 

While this theorem is quite general, it is not very insightful as computing may not be easy. Next, we 
make an assumption that allows us to simplify the lower bound and give a general result for the programs 
in (2.7). 

Assumption 1 For all 1 < i ^ j < t, (en,i,enj) > where {er],i}J^i is given in Definition 2.3. 

By Lemma 2.1, sign vectors {ei\\^i have positive inner products with Xq. Since, xq G 7?., this also implies 

(en,i,xo) = (ej,xo) > for 1 < i < r. (3.2) 

Assumption 1 takes one step further and assumes pairwise nonnegative inner products between {en,i}^=i. 
The following lemma, which is easy to show, gives a sufficient condition for Assumption 1 to hold. 

Lemma 3.1 Assumption 1 holds if the angle between xo and Bi is upper bounded by j for all 1 < i < t. 

Angles between xq and the e^'s are always upper bounded by ^ due to Lemma 2.1, and when a stricter 
condition holds on these angles. Assumption 1 also holds. 

As discussed before, there are various options for the scalarizing function in (2.7), with one choice being 
the weighted sum of norms. In fact, for a recoverable point xq there always exists a weighted sum of norms 
which recovers it. This function is also often the choice in applications, where the space of positive weights 
is searched for a good combination. Thus, we can state the following theorem as a general result. 

Theorem 3.2 Suppose Assumption 1 holds. Consider d-n^n = min{dim(Ti) : i = 1, . . . ,t} and let 7, ci,C2 
be same as in Theorem 3.1. Then, whenever m < min{mQ, liiiiiM^L^ | ^ xg will not be a minimizer of any of 
the recovery programs in (2.7) with probability at least 1 — ci exp(— C2 min{mQ,n(l — rj{C°))'^}), where 

■ d„ 



8I72T 



and 



and 6 = min 



en,i 2 



i<»<T if dim(Ti) i<i<T \\ei\\2 ' 

from Definitions 2.2 and 2.3 respectively. 

Observe that Theorem 3.2 is stronger than stating "a particular function /i(||x||(i), . . . , ||x||(t-)) will not 
work". Instead, our result states that with high probability none of the programs in the class (2.7) can 
return xq as the optimal unless the number of measurements are sufficiently large. 

To understand the result better, note that the required number of measurements is proportional to 
drain which is the dimension of the smallest support. As we have argued in Section 2.1. dimension of the 
support corresponds to how structured the signal is. For sparse signals it is equal to the sparsity, and for a 
rank r matrix, it is equal to the degrees of freedom of the set of rank r matrices. Consequently, Theorem 
3.2 suggests that even if the signal satisfies multiple structures, the required number of measurements is 
effectively determined by only one dominant structure. 
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Intuitively, the degrees of freedom of a simultaneously structured signal should be much lower, on the 
order of dim(Tn). Hence, there is a considerable gap between the expected measurements based on model 
complexity (dim(Tn)) and the number of measurements needed for recovery via (2.7) (min; dim(Ti)). 

Finally, as shown in Section 5, k and 9 can be lower bounded by constants for the examples of norms we 
consider. For the specific cones considered there, r]{C°),r]{C) and 7 are constants as well, (see Appendix A) 
and T is generally a small positive integer. In these cases, the required number of measurements is directly 
determined by dmin (see the next section). 

3.2 Simultaneously Sparse and Low rank Matrices 

We now focus on a special case, namely simultaneously sparse and low-rank (S&L) matrices. We consider 
matrices with nonzero entries contained in a small submatrix where the submatrix itself is low rank. Here, 
norms of interest are || • ||i.2, |1 ■ |li and || • ||* and the cone of interest is the PSD cone. We also consider 
nonconvex approaches and contrast the results with convex approaches. For the nonconvex problem, we 
replace the norms || • • ||i,2,|| • |U '^ith the functions || • ||o, || ■ ||o,2, rank(-) which gives the number 
of nonzero entries, the number of nonzero columns and rank of a matrix respectively and use the same 
cone constraint as the convex method. We show that convex methods perform poorly as predicted by the 
general result in Theorem 3.2, while nonconvex methods require optimal number of measurements (up to a 
logarithmic factor). Proofs are given in Section 5. 

Definition 3.2 We say Xq G M"^ ^"^ is an ShL matrix with (fci, ^2, r) if the smallest submatrix that contains 
nonzero entries 0/ Xq has size ki x ^2 and rank(Xo) = r. When Xq is symmetric, let n = ni ~ n2 and 
k = ki = We consider the following cases. 

(a) General: Xq G M"!^"^ s^L with {ki,k2,r). 

(b) PSD, arbitrary rank: Xq G R"""" is PSD and SkL with {k,k,r). 

(c) PSD, rank 1: Xq = xqx^ where xo G K" is k-sparse so that Xq is PSD and SSzL with (k, k, 1). 

We are interested in S&L matrices with ki <C fc2 ^ ri,2 so that the matrix is sparse, and r <^ minjfci, ^2} 
so that the submatrix containing the nonzero entries is low rank. Recall from Section 2.2 that our goal is to 
recover Xq from random Gaussian observations Cy(Xo) via convex or nonconvex optimization programs. The 
measurements can be cquivalently written as Gvcc(Xo), where G G g^^^j vec(Xo) G R"^"^ denotes 

the vector obtained by stacking the columns of Xq. 

Based on the results in Section 3.1, we obtain lower bounds on the number of measurements for convex 
recovery. We additionally show that significantly fewer measurements are sufScicnt for non-convex programs 
to uniquely recover Xq; thus proving a performance gap between convex and nonconvex approaches. The 
following theorem summarizes the results. 

Theorem 3.3 (Performance of S&L matrix recovery) Assume m < ^^J^, and consider recovering 
Xo G R"ix"2 

minimize /(X) subject to ^J(X) = QCK.q). (3-3) 

XGC 

For the cases given in Definition 3.2, the following convex and nonconvex recovery results hold for some 
positive constants ci,C2. 

(a) General model: 

(al) Let /(X) = ||X||i^2 + Ai||X^||i,2 + A2||X||* and C = R"!^"^ Then, (3.3) will fail to recover 
Xo with probability 1 — exp(— cirf) for all possible Ai,A2 > whenever m < C2d where d = 
min{nifc2, ^2^1, (ni + n2)r}. 

(a2) Let /(X) = + ^^^^ + ""^^^^ and C = R"!^"^ . Then, (3.3) will uniquely recover Xo with 

probability 1 — cxp(— cim) whenever m> C2 max{(fci -t- fc2)r, fci log ^2 log f^}- 
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Setting 


Nonconvex sufficient m 


Convex required m 


General model 


0(max{rA:, fclog ^}) 




PSD, arbitrary rank 


0(max{r/fc,fclogf}) 


n{rn) 


PSD, rank 1 


O(fclogH) 


ri(min{fc'^, n}) 



Table 2: Summary of recovery results for models in Definition 3.2, assuming ni = n2 = n and k\ = k2 = k. For 
the PSD rank 1 case, we assume to be a constant. Nonconvex approaches are optimal up to a logarithmic 

V fc ||xo II 2 

factor, while convex approaches perform poorly. 

(b) PSD, arbitrary rank: 

(bl) Let /(X) = ||X|ji_2 + and C = S". Then, (3.3) will fail to recover Xq with probability 

1 — exp(— ciTTi) for all possible A > whenever m < C2rn. 

(b2) Let /(X) = HffliLl + £^£M20 and C = Then, (3.3) will uniquely recover Xq with probability 
1 — exp(— cito) whenever m > C2 max{rfc, fclog ^}. 

(c) PSD, rank 1: 

(cl) Let /(X) = ||X||i + A||X||i and C = §" . Then, (3.3) will fail to recover Xq with probability 
1 — exp(— cic?) for all possible A > whenever m < C2d where d = ^-^^^^^-i inin{fc^, n}. 

(c2) Let /(X) = Sil + £££M20 and C = S". Then, (3.3) will uniquely recover Xq with probability 
1 — exp(— cito) whenever m > C2fclog-^. 

The noneonvex programs require almost the same number of measurements as the degrees of freedom 
(or number of parameters) of the underlying model. For instance, it is known that the degrees of freedom 
of a rank r matrix of size fci x k2 is simply r(fci + fc2 — r) which is 0((fci + k2)r). Hence, the nonconvex 
results are optimal up to a logarithmic factor. On the other hand, our results on the convex programs that 
follow from Theorem 3.2 indicate that the required number of measurements are determined by the support 
with the smallest dimension. For example, for Xq obeying the 'general model', the dimension of supports 
for the norms ||X||i,2, ||X"^||i,2 and ||X||j, at Xq are nik2,n2ki and (ni +n2 — r)r respectively, and we indeed 
require r2(min{nifc2j ^^2^15 (^^i +n-2)r}) measurements. Table 2 provides a quick comparison of the results on 
S&L. It can be seen that there is a meaningful gap: while nonconvex programs are successful with orderwise 
optimal number of measurements, convex programs need significantly more measurements. We observe a 
similar gap for the special case of simultaneously sparse and rank-1 PSD matrices. 

While there is a gap between nonconvex and convex recovery, our performance bound on the mixed convex 
optimization, can be related to the performance of best of the individual norms. This comes from the fact 
that for many structure inducing norms, dimension of the support and the required number of measurements 
for recovery are same up to a logarithmic factor. For instance, for the "PSD, rank 1" model, only nuclear 
norm minimization would require 0{n) and only £1 norm would require O(fc^log^) measurements, where 
the dimensions of the corresponding supports are of order n and fc^ respectively. Hence, the best of the 
individual norms gives 0(min{n, fc^ log -^j), which is only logarithmically larger than the bound given by 
Theorem 3.3, statement (cl). Similar result is true for the matrices obeying the "PSD, arbitrary rank" 
model, when we consider the fact that minimizing only £1^2 norm would require 0{kn) ([36]) and only 
nuclear norm would require 0{rn) measurements for perfect recovery. Hence the best individual norm 
requires 0(min{fcn,™}} = 0{rn), which is same as the bound given in statement (bl). Overall, mixed 
norms are not able to significantly outperform the best of the individual norms. 

Finally, as we saw in Section 3.1, adding a cone constraint to the recovery program does not help in 
reducing the lower bound by more than a constant factor. In particular, we discuss the positive semidefinite- 
ness assumption, and show that in the sparse phase retrieval problem, the number of measurements remain 
high even when we include this extra information. On the other hand, the nonconvex recovery programs 
performs well even without the PSD constraint. 
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4 General Simultaneously Structured Model Recovery 



Recall the setup from Section 2. As discussed in Section 2.1, we consider a vector xq £ R" at which a family 
of norms {|j • are decomposable, and xq satisfies the cone constraint xq S C. Recall that the supports 

corresponding to the norms are given by {Ti}J^i and Tn = HiLi ^i- This section is dedicated to the proofs 
of theorems in Section 3.1 and additional side results where the goal is to find lower bounds on the required 
number of measurements to recover Xg. By discussions in Section 2.2, we can focus on point-wise weighted 
summation of norms and the same lower bounds will work for any function in (2.7). 

4.1 Preliminary Lemmas 

We first show that, to recover xq, the objective function maxi<i<i- can be viewed as the 'best' among 

the functions mentioned in (2.7). 

Lemma 4.1 Consider the class of recovery programs in (2.7). If the program 

■ ■ ■ p / N A 

minimize fbost x = maxi-i r n — rr^ 
subject to tJ(x) = tj(xo) 

fails to recover Xq, then any member of this class will also fail to recover Xq. 

Proof. Suppose (4.1) does not have xq as an optimal solution and there exists x' such that /bost(x') < 
/best(xo), then 

T — n — < /best(x') < /best(xo) = 1, for i = 1, . . . ,r, 

\\^0\\{i) 

which implies, 

||x'||(,) < ||xo||(j), for ah i = 1,...,T. (4.2) 

Conversely, given (4.2), we have /bcst(x') < /bost(xo) from the definition of /best- 

Furthermore, since we assume h{-) in (2.7) is non-decreasing in its arguments and increasing in at least 
one of them, (4.2) implies /(x') < /(xg) for any such function /(•). Thus, failure of /bcst(-) hr recovery of 
Xq implies failure of any other function in (2.7) in this task. ■ 

The following lemma gives necessary conditions for xq to be a minimizer of the problem (2.7). 

Lemma 4.2 Let Q* denote the adjoint of the linear map Q. If:x.Q is a minimizer of the program (2.7), then 
there exist v G C*, z, and g G 9/(xo) such that 

g-v-tj*(z)=0 and (xo,v)=0. 

The proof of Lemma 4.2 follows from the KKT conditions for (2.7) to have xo as an optimal solution [39, 
Section 4.7]. 

The next lemma describes the subdifferential of any general function /(x) = /i(j|xj|(i), . . . , ||x||(t-)) as 
discussed in Section 2.2. 

Lemma 4.3 For any subgradient of the function /(x) = /i(||x|| (j^), . . . , ||x||(t-)) at x ^ defined by convex 
function h{-), there exists non-negative constants Wi, i — 1, . . . ,t such that 

T 

g = X! 

i=l 

where gi G 9||xo||(j) . 
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Proof. Consider the function iV(x) = [||x||(i), ||x||(t-)] by which we have /(x) = h{N{x.)). By 

Theorem 10.49 in [40] we have 

a/(x) = y {9(y^7V(x)) : y G a/i(7V(x))} 

where we used the convexity of / and h. Now notice that any y G dh(N(x.)) is a non-negative vector because 
of the monotonicity assumption on h{-). This imphes that any subgradient g G 9/(x) is in the form of 
d{w'^ N{x.)) for some nonnegative vector w. The desired result simply follows because subgradients of conic 
combination of norms are conic combinations of their subgradients, (see e.g. [41]). ■ 

Using Lemmas 4.2 and 4.3, we now provide the proofs of Theorems 3.1 and 3.2. 
4.2 Proof of Theorem 3.1 

Suppose xo is a minimizer of (2.7). From Lemma 4.2, there exist a g S 9/(xo), z G R™ and v G C* such 
that 

g = r(z)+v (4.3) 

and (xq, v) = . To use the spectral properties of random Gaussian map Q we will eliminate the contribution 
of V in equation (4.3). Recalling, (2.6), observe that 7'k(v) = as v G 7?.^ by definition. Now, projecting 
both sides of (4.3) onto the subspace TZ gives 

rn{g)=rnig*{z)). (4.4) 

From Lemma 4.3, V-nis) li^s in the span of {Gn,i}i=i which is a r dimensional subspace. On the other 
hand range(7'7j(t/*)) is an m dimensional subspace chosen uniformly at random in TZ. Hence, whenever 
m < dun{TZ) — r, these subspaces have trivial intersection with probability 1, which implies that there does 
not exist a z satisfying (4.4). Hence, for recovery of xq, we need m > dim(7?.) — r. 

Let us call m' = min{mo, liilnM^L^ | , jf dim (7?.) — t > m', we can already conclude with the desired 
result. Otherwise, we assume m' > dim(7?,) — t. Furthermore, it is safe to prove the result for m = m', as 
fewer measurements can only increase the chance of failure. Next, we consider three events each of which 
hold with high probability. Below, c[, C2 are the proper corresponding constants for each case. 

• Using Theorem A. 3, since m < IIl^il^LJlli ^ with probability at least 1 — c'l exp(— 02(1 — ii{C°Yf"n), for 
all z G M™, we have 

||r(z)||2 <7l|7'c(r(z))||2 (4.5) 

• Observe that Q* is equivalent to an 71 x m matrix with i.i.d. standard normal entries and similarly 
'PniG*) is equivalent to an dim(7?.) x m matrix with i.i.d. standard normal entries after proper unitary 
transformation. Using Corollary (5.35) in [50] and choosing t = \fm! , with probability 1 — 4cxp(--^), 
we have 

a„,i„(r) > \/^-\/^-\/^> ^ (4.6) 

fTmax(7'K(^^*)) < V'dM^+ \/^+\/^< 3VWTT (4.7) 

Overall, inequalities (4.5), (4.6) and (4.7) hold with probability at least 1 — ci exp(— C2 min{mo, n(l — 
r;(C°))^}). Assuming they hold, we will show that multipliers g,v, z in (4.3) cannot exist. To show this 
by contradiction, assume these inequalities hold and such g, v, z exist. 

Since v G C*, from Lemma A.l we have 7-'c(— v) = 'Pc(5*(z) — g) = 0. Using Corollary A.l, 

||g!|2>||7'c(r(z))||2. (4.8) 
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Now, combining (4.4), (4.5) and (4.8), we have 



Using (4.6) and (4.7), we can rewrite (4.9) as 

\\Vn{g)h ^ II II ^ 37||gH2 

< Z 2 < 



which gives 



niQ > n — - — — T > mi n — - — — t = mo, 

97|jg||2 / gea/(xo) V 97l|g||2 / 



which is a contradiction. 



4.3 Proof of Theorem 3.2 

The result follows immediately from Theorem 3.1. We first find a lower bound on ttiq. From Lemma 4.3, 
any g G /(xo) can be written as g = '^iSi for some non-negative coefficients Wi. Now, using Lemma 

B.l, we can boimd the subgradicnt as 

r T T 

l|g||2 = II ^U',g,||2 < ^U),||g,||2 < ^WiL„ 
i—1 i—1 i—1 

where Li is the Lipschitz constant of norm || • ||(^^. Next, from the definition of k we have 



lit"''-') i^t'^'^' -;i:>"^^ffs|)^i-s 

— l / 2 — 1 2—1 2—1 



wf\\e^\\l 



where the leftmost inequality follows from the Cauchy-Schwarz inequality. Overall, we find 

" > (4.10) 



i !l^i|l2 



llslli - ^ELi^flle. 
We will now estimate ||P?^(g)||2• First, we have 

r r 
i=l i=l 

which by using Assumption 1 gives 



r7^(g)||^ = II E^'^n..||2 > E^'ll^n..||^ > 0^Em2||e,||2. (4.11) 



i—1 i—1 i—1 

Combining (4.10) and (4.11), we find 

|7'K(g)||2\' 



„ II II , > ^-^rfmin=m^+T. (4.12) 

97||g||2 / 8l7^T 
The last inequality is true for all g e 9/(xo); hence, 

f\\rTz{g)h\\ ,^ 

mQ + T — mm n ^ , — > mj, + r . 
gea/(xo) V 97||g||2 / 

Theorem 3.2 follows immediately from Theorem 3.1 as any m < min{mQ, liiin^i^L^} satisfy m < 
min{mo, Siln^i^L^j well. 
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5 Simultaneously Sparse and Low-rank Matrices 



Using the general framework provided in Section 3.1, in this section we present the proof of Theorem 3.3, 
which states various convex and nonconvex recovery results for the S&L models. We divide the proof into two 
parts, namely proofs for convex recovery and nonconvex recovery. We begin with convex recovery results. 

5.1 Convex recovery results for S'&L 

In this section, we prove the statements of Theorem 3.3 regarding convex approaches, using Theorem 3.2. 
We begin with the following lemma which gives results on sign vectors and supports for the S&L model. 
The proof is provided in Appendix B. 

Lemma 5.1 Denote the norm f{X) ^ WX'^W 1^2 for all X e M."^"""'' 6?/ II -^11 1,2- Given a matrix Xq e W^^"""-' , 
let E*,Ec,Er,Ei and Tj,,Tc,Tr,Ti be the sign vectors and supports for the norms \\ ■ ||^, || • ||i^2, || ||i,2; 
II • 111 respectively. Then, 

• E,,E,,EeeT,nrcnr,, 

• {E^,Er) > 0, (E*,Ec) > 0, and (Ec,Er) > 0. 

Now, assume Xq = cruv'^ is a rank one matrix, where u G and v £ R"^ are ki and k2 sparse unit length 
vectors and a = ||Xo||f- W^e have 

. E. eT.nTi and||7'T,nTi(Ei)||f >max{Mi,Mi}||Ei||^, 

• (7'T.nTi(Ei),E,) > 0. 

5.1.1 Proof of Theorem 3.3: Convex cases 

Proof of (al) Wc use the functions || • ||i,2, || ||i,2 and || • ||^ without the cone constraint, i.e., C = E"i^"^. 
Following the notation of Lemma 5.1, Tn = Ti^nTcHTr and all the sign vectors lie on 7?. = Tn: which means 
9 — 1, and they have pairwise nonnegative inner products. Also, dim(T'c) = fc2'^i, dim(T!r) = fcin2 and 
dim{Ti,) ~ (rii + n2 — r)r. Hence dmin = min{(ni + ^2 — r)r, nifc2, 71,2^1}. Furthermore, from Lemma 2.3, 
K > i. Applying Theorem 3.2 gives the result. 

Proof of (bl) In this case, we apply Lemma B.3. We have TZ = Tr, D S", the norms are the same as in 
the general model, and 9 > Also, pairwise inner products are positive, dmin = min{(2n — r)r,kn} and 

K > i. Based on Corollary A. 2, for the PSD cone we have 7 < 11. The result follows from Theorem 3.2. 

Proof of (cl) In this case, the norms are || ■ ||i and || • ||*. From Lemma B.3, 7^ = Ti n T* n §" and 
9 > ^ir^TT- Similar to (bl), 7 < 11. Also, dmin = min{A:^,n}. The result follows from Theorem 3.2. 

1 1 Xf) 1 1 2 V 

5.2 Nonconvex recovery results for S&L 

We first state a lemma that will be useful in proving the nonconvex results. The proof is provided in the 
Appendix C and uses standard arguments. 

Lemma 5.2 Consider the set of matrices S in ^/j^^ Q,^g supported over a dixd2 submatrix with rank 

at most s. There exists a constant c > such that whenever m > cmin{((ii +^2)5,(^1 log ^,d2 log with 
probability 1 — 2exp(— cm) an i.i.d. Gaussian operator will satisfy the following, 

g{X) ^ 0, for all XeS. (5.1) 
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5.2.1 Proof of Theorem 3.3: Nonconvex cases 



Denote the sphere m -with unit Frobenius norm by B. 

Proof of (a2) Observe that the function f (X) = iJ^^^lf'^ "^'i^^/^l satisfies the triangle inequality 

^ ' J V / IIXollo 2 ||Xj5 ||o.2 ranR(Xo) o -l ^ 

and we have /(Xq) = 3. Hence, if all null space elements W e ■N'iG) satisfy /(W) > 6, we have 

/(X)>/(X-Xo)-/(-Xo)>3, 

for all feasible X which implies Xq being the unique minimizer. 

Consider the set S of matrices, which are supported over a Qki x 6fc2 submatrix with rank at most 
6r. Observe that any Z satisfying /(Z) < 6 belongs to S. Hence ensuring M{G) f^ S = {0} would ensure 
/(W) > 6 for all W € M{Q). Since S' is a cone, this is equivalent to M{G) n (5" n S) = 0. Now, applying 
Lemma 5.2 with set S and di = 6fci, d2 = 6fc2, s = 6r we find the desired result. 

Proof of (b2) Observe that due to the symmetry constraint, 

/(X) = ll^ll"-^ + ^ rank(X) 

||Xoj|o,2 1 1 X^ 1 1 0,2 rank(Xo) 

Hence, the minimization is the same as (a2), the matrix is rank r contained in a fc x fc submatrix and 
we additionally have the positive semidefinite constraint which can only reduce the amount of required 
measurements compared to (a2). Consequently, the result follows by applying Lemma 5.2, similar to (a2). 

Proof of (c2) Let C = {X 7^ 0|/(X) < /(Xq)}. Since rank(Xo) = 1, if /(X) < /(Xq) = 2, rank(X) = I. 
With the symmetry constraint, this means X = ±xx'^ for some ^-sparse x. Observe that X — Xq has rank 
at most 2 and is contained in a 2fc x 2k submatrix as I < k. Let S be the set of matrices that are symmetric 
and whose support lies in a 2fc x 2k submatrix. Using Lemma 5.2 with s ^ 2, di = d2 = 2k, whenever 
m > cfclog ^, with desired probability all nonzero W G S will satisfy ^(W) ^ 0. Consequently, any X € C 
will have -4(X) ^ ^(Xq), hence Xq will be the unique minimizer. 



6 Numerical Experiments 

In this section, we numerically verify our theoretical bounds on the number of measurements for the Sparse 
and Low-rank recovery problem. We demonstrate the empirical performance of the weighted maximum of 
the norms /best (see Lemma 4.1), as well as the weighted sum of norms. 

The experimental setup is as follows. Our goal is to explore how the number of required measurements m 
(or in the second set of experiments) scales with the size of the matrix n. We consider a grid of (m, n) 

values, and generate at least 100 test instances for each grid point (in the boundary areas, we increase the 
number of instances to at least 200). We generate the target matrix Xq by generating a. k x r i.i.d. Gaussian 
matrix G, and inserting the k x k matrix GG"^ in an n x n matrix of zeros. We take r = 1 and A: = 8 in 
all of the following experiments; even with these small values, we can observe the scaling predicted by our 

IIX— X II 

bounds. In each test, we measure the normalized recovery error ||-^^||"^ and declare successful recovery 

when this error is less than 10"**. The optimization programs are solved using the CVX package [42], which 
calls the SDP solver SeDuMi [43]. 

We first test our bound in part (b) of Theorem 3.3, D,(nr), on the number of measurements for recovery 
in the case of minimizing max{:j||^y, over the set of positive semi-definite matrices. Figure 4 shows 

the results, which demonstrates m scaling linearly with n (note that r = 1). 

Next, we replace £1^2 norm with £1 norm and consider a recovery program that emphasizes entry- wise spar- 
sity rather than block sparsity. Figure 5 demonstrates the lower bound n{mm{k^ , n}) in Part (c) of Theorem 
3.3 where we attempt to recover a rank-1 positive semi-definite matrix Xq by minimizing max{:^^^y, il^j^} 
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Figure 4: Performance of the recovery program minimizing max{-j^?|^j, |^^||^^'^^ } with a PSD constraint. The dark 
region corresponds to the experimental region of failure due to insufficient measurements. As predicted by Theorem 
3.3, the number of required measurements increases linearly with nr. 



subject to the measurements and a PSD constraint. As pointed out in Section 5, the lower bound given in 
Theorem 3.2 has a target-dependent term 6^, so to make the different test instances comparable, wc plot ^ 
versus n. The green curve in the figure shows the empirical 95% failure boundary, depicting the region of 
failure with high probability that our results have predicted. It starts off growing linearly with n, when the 
term nr dominates the term fc^, and then saturates as n grows and the term (which is a constant in our 
experiments) becomes dominant. 




Figure 5: Performance of the recovery program minimizing max{- 
and 71 is allowed to vary. The plot shows ^ 
by Theorem 3.3. 



ti-(X) ||X| 



r(Xo)' IIXolli 



p} with a PSD constraint, r = 1, fc = 



versus n to better illustrate the lower bound 51(min{fc , jir}) predicted 



The penalty function max{ ^^^^^o) ' } depends on the norm of Xq. In practice the norm of the solution 
is not known beforehand, a weighted sum of norms is used instead. In Figure 6 we examine the performance 
of the weighted sum of norms penalty in recovery of a rank-1 PSD matrix, for different weights. We pick 
A = 0.20 and A ~ 0.35 for a randomly generated matrix Xq, and it can be seen that we get a reasonable 
result which is comparable to the performance of max{^|^^j, 

In addition, we consider the amount of error in the recovery when the program fails. Figure 7 shows two 
curves below which we get a 90% percent failure, where for the green curve the normalized error threshold for 
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declaring failure is 10 and for the red curve it is a larger value of 0.05. We minimize maxj . , ^ I 

° ' ° tr(A.o) ' II Xo II 1 J 

as the objective. We observe that when the recovery program has an error, it is very likely that this error 
is large, as the curves for 10~* and 0.05 almost overlap. Thus, when the program fails, it fails badly. This 
observation agrees with intuition from similar problems in compressed sensing where sharp phase transition 
is observed. 




Figure 7: 90% frequency of failure where the threshold of recovery is 10~* for the green and 0.05 for the red curve. 
max| ^\^?^\ , J^4r-} is minimized subject to the PSD constraint and the measurements. 

l-tr{Xo) ' IIXolli J ■> 

As a final comment, observe that, in Figures 5, 6 and 7 the required amount of measurements slowly 
increases even when n is large and = 64 is the dominant constant term. While this is consistent with our 
lower bound of r2(fc^, n), the slow increase for constant k, can be explained by the fact that, as n gets larger, 
sparsity becomes the dominant structure and £i minimization by itself requires 0{k'^ log ^) measurements 
rather than 0{k^). Hence for large n, the number of measurements can be expected to grow logarithmically 
in n. 



7 Discussion 

We have considered the problem of recovery of a simultaneously structured object from limited measurements. 
It is common in practice to combine known norm penalties corresponding to the individual structures (also 
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known as regularizers in statistics and machine learning applications) , and minimize this combined objective 
in order to recover the object of interest. The common use of this approach motivated us to analyze its 
performance, in terms of the smallest number of generic measurements needed for correct recovery. We 
showed that, under a certain assumption on the norms involved, the combined penalty requires more generic 
measurements than one would expect based on the degrees of freedom of the desired object. Our lower 
bounds on the required number of measurements implies that the combined norm penalty cannot perform 
significantly better than the best individual norm. 

These results raise several interesting questions, and lead to directions for future work. We briefly outline 
some of these directions, as well as connections to some related probelms. 



Other types of measurements, and sparse phase retrieval. Our current analysis is limited to random 
Gaussian measurements, and in some applications other kinds of measurements need to be considered. In 
the sparse phase retrieval problem, the measurements are given by {&iaf,'K) = bi, where a^ is a random 
Gaussian vector and we wish to recover the underlying matrix Xq = xoxj^ where xq is a /c sparse vector. 
In this case, a different probabilistic argument might be necessary due to the product of random Gaussian 
variables that appear. In [31] the authors derive recovery guarantees for these measurements, but they do not 
consider signal sparsity, and their analysis docs not immediately extend to the sparse case. The very recent 
paper [25] considers the sparse phase retrieval problem, and gives two results, assuming xq has unit £2 norm 
and m <^ n: first, if to > O (||xo||^fc log n) then minimizing ||X||i + Atr (X) for suitable value of A over the 
set of PSD matrices will exactly recover Xq with high probability. Second, they give a necessary condition 
on the number of measurements, TOq = min — 1)^, '"^'^(^^oili^^^V^.o) ^ ^ under which the recovery program 

fails to recover Xq with high probability whenever m < TOq . 

We should emphasize that the lower bound provided in [25] is directly comparable to our results. First, 
observe that since Xq is rank 1, and we arc using £1 and nuclear norm subject to a PSD constraint, so the 

problem falls into Model (c) of Definition 3.2. Our result regarding this model is given in Theorem 3.3 which 

I II ^ 

suggests that one needs at least jtiq = c ^ min{fc^, n} measurements. Assuming m n, our lower bound 
takes a simpler form which is TOq = c||xo||^fc. 

Now, comparing our lower bound to the bound given by [25], we have 

< max(||xo||r fc/2,0)^ < < < 500c^ . 

500 log 500 log n 500 log n log (n) 

where we use the fact that ||xo||i < \/k for unit length and k sparse vector Xq. These simple operations 
suggest our lower bound is larger by a factor of > log^ n where we omit the constant term 500c 

for the sake of clarity. Overall, while the present paper and [25] analyze the same problem with different 
measurement operators, the results are consistent as logarithmic terms have relatively minor importance. 

It is also of interest to study recovery properties of simultaneously structured models using other classes 
of measurements, for example cases where the measurement vectors {a^}™ are binary or are sampled from 
the rows of a Discrete Fourier Transform matrix. 



Quantifying recovery failure via error bounds. We observe from the recovery error plots shown in 
Figure 7 that whenever our recovery program fails, it fails with a significant recovery error. The figure shows 
two curves under which recovery fails with high probability, where failure is defined by the normalized error 
||X — Xo||F/|iXo||_F being above 10"'' and 0.05. The two curves almost coincide. This observation leads to 
the question of whether we can characterize how large the error is with a high probability over the random 
measurements. A lower bound on the recovery error as a function of the number of problem parameters will 
be very insightful. 



Defining nevf atoms for simultaneously structured models. Our results show that combinations of 
individual norms do not exhibit a strong recovery performance. On the other hand, the seminal paper [8] 



20 



proposes a remarkably general construction for an appropriate penalty given a set of atoms. Can we revisit 
a simultaneously structured recovery problem, and define new atoms that capture all structures at the same 
time? And can we obtain a new norm penalty induced by the convex hull of the atoms? Abstractly, the 
answer is yes, but such convex hulls may be hard to characterize, and the corresponding penalty may not be 
efficiently computable. It is interesting to find special cases where this construction can be carried out and 
results in a tractable problem. 

Algorithms for minimizing combination of norms. Despite the limitation in their theoretical per- 
formance, in practice one may still need to solve convex relaxations that combine the different norms, i.e., 
problem (2.7). Consider the special case of sparse and low-rank matrix recovery. All corresponding optimiza- 
tion problems mentioned in Theorem 3.3 can be expressed as a semidefinite program and solved by standard 
solvers; for example, for the numerical experiments in Section 6 we used the interior-point solver SeDuMi 
[43] via the modeling environment CVX [42]. However, interior point methods do not scale for problems 
with tens of thousands of matrix entries, which are common in machine learning applications. One future 
research direction is to explore first-order methods, which have been successful in solving problems with a 
single structure (for example £i or nuclear norm rcgularization alone). In particular. Alternating Directions 
Methods of Multipliers (ADMM) appears to be a promising candidate. 

Characterizing the tightness of the lov^rer bounds. The results provided in this paper are negative 
in nature, as we characterize the lower bounds on the required amount of measurements for mixed convex 
recovery problems. However, it would be interesting to see how much we can gain by making use of multiple 
norms and how tight are these lower bounds. In [44], authors investigate a specific simultaneous model where 
signal X S R" is sparse in both time and frequency domains, i.e., x and Dx are fci, ^2 sparse respectively where 
D is the Discrete Fourier Transform matrix. For recovery, the authors consider minimizing ||x||i -I- A||Dx||i 
subject to measurements. Intuitively, results of this paper would suggest the necessity of n(min{fci, ^2}) 
measurements for successful recovery. On the other hand, best of the individual functions {£1 norms) will 
require f2(min{fci log ^2 log -^}) measurements. In [44], it is shown that the mixed approach will require 
as little as max{fci, ^2} log log n under mild assumptions. 

This shows that the mixed approach can result in a logarithmic improvement over the individual functions 
when ki w fc2 and the lower bound given by this paper can be achieved up to a log log n factor. 

Connection to Sparse PC A. The sparse PCA problem (see, e.g. [45, 46, 47]) seeks sparse principal 
components given a (possibly noisy) data matrix. Several formulations for this problem exist, and many 
algorithms have been proposed. In particular, a popular algorithm is the SDP relaxation proposed in [47], 
which is based on the following formulation. 

For the first principal component to be sparse, we seek an x S M" that maximizes x^Ax for a given 
data matrix A, and minimizes l|x]|o. Similar to the sparse phase retrieval problem, this problem can be 
reformulated in terms of a rank-1, PSD matrix X = xx^ which is also row- and column-sparse. Thus we 
seek a simultaneously low-rank and sparse X. This problem is different from the recovery problem studied 
in this paper, since we do not have m random measurements of X. Yet, it will be interesting to connect this 
paper's results to the sparse PCA problem to potentially provide new insights for sparse PCA. 
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APPENDIX 



A Properties of Cones 

In this appendix, we state some results regarding cones which are used in the proof of general recovery. 
Recall the definitions of polar and dual cones from Section 2. 

Theorem A.l (Moreau's decomposition theorem, [48]) Let C be a closed and convex cone in R". 
Then, for any x G M", we have 

• x = 7'c(x) +7'c°(x). 

• (7'c(x),7'co(x)) -0. 

Lemma A.l (Projection is nonexpansive) Let C G M" be a closed and convex set and a, b e M" be 
vectors. Then, 

||7'c(a)-7'c(b)||2 < ||a-b||2. (A.l) 
Corollary A.l Let C be a closed convex cone and a, b be vectors satisfying T'c(a — b) = 0. Then 

l|b||2 > ||Pc(a)||2. (A.2) 
Proof Using Lemma A.l, we have |l7'c(a)|l2 = ||'Pc(a) - 7'c(a- b)||2 < |lb||2. ■ 
The unit sphere in R" will be denoted by 5"^^ for the following theorems. 

Theorem A.2 (Escape through a mesh, [49]) For a given set V G 5"^^, define the Gaussian width as 



uj{V) = E 



sup (X, g; 



in which g S R" has i.i.d. standard Gaussian entries. Given m, let d = Jn — m — , . Provided that 

w(I') < d a random m— dimensional subspace which is uniformly drawn w.r.t. Haar measure will have no 
intersection with V with probability at least 

l-3.5exp(-((i-w(X'))2). (A.3) 

Theorem A.3 Gonsider a random Gaussian map Q : R" — > R™ with i.i.d. entires and the corresponding 
adjoint operator Q* . Let C be a closed and convex cone and recalling Definition 3.1, let 

C(C) := 1 - r,(C°), 7(C) - ^'^^^^ 



l-r;(C°)- 

Then, if m < ^^^^n, with probability at least 1 — 6 exp(— (^^)^n), for all z € R" we have 

\\g*{z)h<m\\Vc{G*{z))h. (A.4) 

Proof. For notational simplicity, let ( ~ ({C) and 7 = 7(C). Consider the set 

2? = {xe5"-i : ||x||2 >7ll^c(x)||2}. 

and we are going to show that with high probability, the range of G* misses T>. Using Theorem A.l, for any 
X G P, we may write 

(x, g) = {Vc (x) + Vc (x) , Pc (g) + Pc (g)) 

< (7'c(x),7'c(g)) + (^c=(x),7'co(g)) (A.5) 

< ||^c(x)||2||Pc(g)||2 + |l7'co(x)|l2|l7'c=(g)||2 

<7-'rc(g)||2 + rCo(g)|l2 
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where in (A. 5) we used the fact that elements of C and C° have nonpositive inner products and |l7^c(x)|l2 < 
||x|l2 is by Lemma A.l. Hence, from the definition of Gaussian width, 



w(X>) = E 



sup (X, g) 



<7"'lE[i|Pc(g)||2]+E[||^c=(g)||2] 

2-C r- 



- V^(7-^/(C) + 77(r)) 

Now, whenever 



we have 



m< ^n< (1- (^-^)2)n = m', (A.6) 
16 4 

(V^T^ - ^{V) 1. f > iV^T^ i^{V)f -\> {^fn \. (A.7) 

Now, using Theorem A. 2, the range space of Q* will miss the undesired set T) with probability at least 
l-3.5exp(-(|)2n+i) > 1 - 6 exp(-(|)2n). ■ 

Corollary A. 2 Consider the cones S" and S" in the space R"^". For all positive integers n, we have 

• C(S") > i and 7(§") < 7. 

• C(§+) > I and-f{El) < 11. 

Proof. Let G be an ti x rt matrix with i.i.d. standard normal entries. Set of symmetric matrices §" is an 
^^^^ dimensional subspace of R"><". Hence, E ||7's-.(G)|||, = ^^^^^ and E ||7'(s.)o (G)|||, = ^i^^. 



, , _ E[l|P,s.HG)||.l ^ ^ _ Ve[||P,..,.(G)II|] ^ ^ ^ ^ 1 ^ 

n ?i V 2?! ^/2 

Similarly, 7(8") = ^^Kl < 7j,(§") < 7. 

To prove the second statement, observe that projection of a matrix A € M"^" onto §" is obtained by first 
projecting A onto S" and then taking the matrix induced by the positive eigenvalues of Vs^{A.). Since, G 
and — G are identically distributed and S!J: is a self dual cone, 7's^(G) is identically distributed as — T^s^ (G) 
where §" = (S" )° stands for negative semidefinite matrices. Hence, 



Consequently, 



4;r>^-T 



C(§';) > 1 - ^ : > 1 - ^/ > 1 - V- (A.io) 



B Properties of Norms 

Lemma B.l Given a norm j| • j| , denote by L the Lipschitz constant of this norm. For any x, we have 

L = sup ||y|| = sup ||z||2 > sup ||gi|2. (B.l) 

|y||2<i l|z||*<i ge9||x| 
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Proof. Let M = sup||y||^<j j|yj| and C = sup||2||*<i HzjU- We have the followings. 

- M = L because on one hand for all x we have ||x|| = ||x|| — ||0|| < i||x||2 which implies M < L, and on 
the other hand |j|x|| — ||y|| | < ||x — y|| < -A/||x — y||2 which holds for all x and y and implies L < M . 

- M < C because for any x we have 

||xj| = sup (X, Z) < sup ||z||2]|x||2. 

||z||*<l l|z||*<l 

- In addition, let Zo = argsup||2.||.<i Ilz|l2- Then, we have 

= IIZ0II2 = (zo,zo) = sup (z,zo) = ||zo||, 

||z||*<l 

which gives C = P4- < L. 

° ||zo||2 — 

Overall, we obtain L ~ M ^ C. Finally, since 9||xj| C {z : ||z||* < 1}, we get the last inequality in (B.l). ■ 

B.l Norms in Sparse and Low-rank Model 
B.l.l Relevant notation for the proofs 

Let [k] denote the set {1, 2, . . . , k}. Let Sc, Sr denote the indexes of the nonzero columns and rows of Xq 
so that nonzero entries of Xq lies on Sr x Sc submatrix. Sc,Sr denotes the ki,k2 dimensional subspaces of 
vectors whose nonzero entries lie on Sc and Sr respectively. 

Let Xq have singular value decomposition USV-'" such that S G and columns of U, V lies on Sc, Sr 
respectively. 

B.1.2 Proof of Lemma 5.1 

Proof. Observe that Tc = x Sc and Tr = Sr x E" hence Tc n Tr is the set of matrices that lies on Sr x Sc- 
Hence, = UV^ e TcCiTr. Similarly, Ec and E^ are the matrices obtained by scaling columns and rows 
of Xo to have unit size. As a result, they lie on Sr x Sc and TcHTr. E^^ S by definition. 

Next, we may write Ec = XqDc where T>c is the scaling nonnegative diagonal matrix. Consequently, Ec 
lies on the range space of Xq and belongs to T*. This follows from definition of in Section 2 and the fact 
that (I - UU'^)Ec = 0. 

In the exact same way, E^ = D^Xq for some nonnegative diagonal and lies on the range space of X"^ 
and hence lies on T*. Consequently, E*, Ec, E^ lies on Tc D Tr O T-i, . 
Now, consider 

(Ec, E*) = (XoDc, UV^) = tr (VU^USV^Dc) = tr (VSV^D^) > 0. 

since both VSV"^ and D"^ are positive semidefinite matrices. In the exact same way, we have (Ec,E*) > 0. 
Finally, 

(Ec, EO = (XoDc, D.Xo) = tr (DcX^D^Xq) > 0, (B.2) 

since both Dc and XgD^Xo are PSD matrices. Overall, pairwise inner products of Er,Ec,Ei, are nonnega- 
tive. 

Now, we proceed with the remaining statements where Xq is rank one and we deal with £1 and nuclear 
norms. Since Xq is rank 1, we immediately have E^, = uv^ € Ti since nonzero locations of Xq and E^, are 
same. Observe that Ei is not necessarily inside but both uu^Ei and Eivv^ is inside T-^. Consequently, 
we have 

WVtA'^iWf = ||uu^Ei|||, + ||Eivv^|||,- lluu^Eivv^lll >max{|luu^Ei|||,,l|Eivv^l||,}, (B.3) 
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where 



luu^Eill^^ = ||uu^ sgn(u)sgii(v)' |1f = ||u|liv/fc2 = ^||Ei|li., (B.4) 



and similarly ||EiVv^||j;- ~ ^^i||Ei||i?. Hence, 

|lPT.(Ei)|l^>max{Mi,Hk^p^||^_ (3 5) 

Finally, since E^ e n Ti , we have 

(PT,nTi(Ei),7'T.nTi(E,)> = (Ei, T'r.nTi (E.)) = (Ei,E4 (B.6) 

;sgn(u)sgn(vf ,uv^\ = l|u||i||v||i > 0. (B.7) 



Remark B.l Observe that since u,v are unit length and fci,fc2 sparse, we always have 

„,ax{M^,M^}<i, 

where equality is achieved when \/kiU or are sign vectors; i.e. nonzero entries are —1 or 1. 

In general, we may let Xq = ab"^ where a, b are ki , k2 sparse and their nonzero entries are independent and 
identically distributed as random variables xi,X2 respectively. WLOG, let us consider a. Assuming xi is 
zero-mean with finite fourth moment, for sufficiently large ki , with high probability, we would have, 

||u||i ||a||i E|xi| ^ (gg) 



l|a||2 VEf^f 

This follows from the fact that E[||a||i] = fciE|.Ti| and E[||a||2] = /ci E[a;f ] and hence ||a||i and \\a\\2 will 
concentrate around ki E \xi\ and A/fci E[a;f] as ki grows. For example, when xi ^ A/'(0, 1), we have E[|a;i|] ~ 

and E[a;f] = 1, hence « \/f ■ Overall, it is reasonable to consider as approximately constant 

as right hand side of (B.8) is only a function of the distribution ofxi. The identical argument will apply for 

V. 

B.1.3 Results on positive semidefinite constraint 

Lemma B.2 Assume X, Y e S" have eigenvalue decompositions X = ^™'^''(-^) aiUiuf and Y = ^™^*''('^) 
Further, assume (Y,X) = 0. Then, U"^Y = where U = [ui U2 ... u„„j,(x)]. 

Proof. Observe that, 

rank(X) rank(Y) 

(Y,X)= ^ '^^^MI^A'- (B.9) 

1=1 i=i 

Since Ci, cj > 0, right hand side is if and only if ufvj = for all i,j. Hence, the result follows. ■ 

Lemma B.3 Assume Xq G §" so that in Section B.1.1, Sc = Sr. Tc = Tr, ki = k2 = k and U = V. Let 
Ui G ('=-'■) and U2 G R"x(«-fc) &e smc/i f/iaf [U Ui] and [U Ui U2] be orthonormal bases over Sc and 
M" respectively. Also call Si, — T^, D S" and let, 

3^ = {Y|Yg (§;)*, (Y,Xo)=0}. (B.IO) 

Then, the following statements hold. 



28 



9 Si, = span{y)-^ . Hence, recalling (2.6), TZ = Td D Si,. 

• Based on Definition 3.2, we have TZ = Tc Cl Tr (1 T^, H E"' for "PSD, arbitrary rank" model. Then, 

* ' |Ec||f IE^IIf — ^2 ' 

• Let Xq be rank 1, i.e. Xq = xgx^. Then, 7?. = Ti n Ty, n §" for "PSD, rank 1" model. Then, G TZ 

IIEillF ^ v^!|xo||2- 

Proof. The dual of §![: with respect to M"^" is the set sum of S" and Skew" where Skew" is the set 
of skew-symmetric matrices. Now, assume, Y e y and X G 5*. Then, (Y, X) = (f ,X) where Z = 
Y + e S!;: and (Z, Xo) = 0. Since Xq, Z arc both PSD, applying Lemma B.2, we have U'^Z = hence 
(I - UU'^)Z(I - UU"^) = Z which means Z e T^^. Hence, (Z,X) = (Y, X) = as X e 5* C T*. Hence, 
span(3^) C S^. 

On the other hand, S^ = T^ + (§")-L = T;L _^ gkew". Let Y G and Z = e T;^ ^ gn^ 

Observe that Y — Z G Skew" G y. Let Z has eigenvalue decomposition Z = XiZizf. Under proper 
unitary rotation, H §" is equivalent to a set of matrices that are symmetric and supported over an 
(n — r) X in — r) submatrix. Hence, eigenvectors also satisfy z^z^ G T^nS". Now, observe that z^z^ G (§")* 
and (ziZ^,Xo) = 0. Hence, z^z^ G y which implies Z, Y G span(3^). Overall, S-^ C span(3^). Combined 
with the previous result we have Sj; = span(3^) 

For the second statement, Tn = n Tc n T^ hence 7?. = n Tc n TV n S". Now, recalling Lemma 5.1, 
observe that we already know G Tp where TJ-, = Tc n n T^. Since E^ is also symmetric, E^, G TZ. 
Similarly, E^ G Tn, (Ec,E,.) > and WVni'E^M = 11^4^!!^^ ^ ^tI^' Similar result is true for E^. 

For third statement, Tn = TiHT-^ and TZ = TiHT^nS". Now, observe that when Xq is rank 1, Ti = TcC^Tr 
hence E* G TZ. Secondly, Ei = sgn (xq) sgn (xq)^ G S and T^Tr. (Ei) G S as well since X G Tn =^ X"^ G Tp. 

Then, PTn(Ei) G §n Tn = 7^. Finally, Ei - Vrni^i) G Tr| C 7^-L as 7^ C Tp. This implies PTn(Ei) = 
'p7^(Ei) when combined with 7'Tn(Ei) G TZ. Hence using Lemma 5.1, 

\\Vn{^i)\\F = \\VTn{^,)\\F > (B.U) 

vA:||xo||2 



C Results on non-convex recovery 

Next two lemmas are standard results on Gaussian measurement operators. 

Lemma C.l (Properties of Gaussian mappings) Assume X is an arbitrary matrix with unit Frobenius 
norm. An i.i.d. Gaussian measurement operator Q{-) satisfies the following: 

. E[\\g{X)\\l] = m. 

• There exists an absolute constant c > such that, for all I > e > 0, we have 

P(III^^(X)||2 -m\> em) < 2exp(-ce^m). (C.l) 

Proof. Observe that, when ||X||i? = 1, entries of t/(X) are i.i.d. standard normal. Hence, the first statement 
follows directly. For the second statement, we use the fact that square of a Gaussian random variable is 
sub-exponential and view ||C/(X)||| as a sum of m i.i.d. subexponentials. Then, result follows from Corollary 
5.17 of [50]. ■ 

For the consequent lemmas, S denotes the unit Frobenius norm sphere in M"^^"'^. 
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Lemma C.2 Let T) E ]R"i><"2 he an arbitrary cone and Q{-) : _^ jgm i.i.d. Gaussian measure- 

ment operator. Assume that the set T) = S HV has e-covering number bounded above by rj{e). Then, there 
exists constants ci,C2 > such that whenever m> ci log 77(1/4), with probability 1 — 2exp(— C2m), we have 

vnN'ig) = {0}. 

Proof. Let rj = ri{j), and {Xij^^j^ be a |;-covering of V. With probability at least 1 — 2riexp{~ce^m), for 
all i, we have 

(l-e)m< M(X,)||^ < (l + e)m. (C.2) 
Now, let Xsup ~ argsupxg-p ||w4(X)||2. Choose 1 < a < rj such that ||Xa — Xsup||2 < 1/4. Then: 

P(X,,p)||2 < ||-A(X„)||2 + P(X,,p - X,)||2 < (1 + e)m + ip(X,,p)||2. (C.3) 

Hence, ||y^(Xsup)||2 < |(l + £)m. Similarly, let Xjnf = arginfxg-p ||.4(X)||2. Choose 1 < b < ij satisfying 
llX,-Xi„f|i < 1/4. Then, 

M(Xi„f)||2 > ||-A(Xb)l|2 - P(Xi„f - Xb)||2 > (1 - e)m - 1(1 + e)m. (C.4) 

This yields j|^(Xinf)||2 > ^~^^ m. Choosing e = 1/4 whenever m > 22 Jog (77) with the desired probability, 
||^(Xinf)||2 > 0. Equivalently, V n N{A) = 0. Since A{-) is linear and I? is a cone, the claim is proved. ■ 

The following lemma gives a covering number of the set of low rank matrices. 

Lemma C.3 (Candes and Plan, [10]) Let M be the set of matrices in j^j^/j f^jij^ at most r. Then, 

for any e > 0, there exists a covering ofSClM with size at most (£3 )("i+"2)r ^/jg^g 

C3 is an absolute constant. 

Ln particular, log(r7(l/4)) is upper bounded by C^"i+'^2)r j^^ 

some constant C > 0. 

Now, we use Lemma C.3 to find the covering number of the set of simultaneously low rank and sparse 
matrices. 

C.l Proof of Lemma 5.2 

Proof. Assume S has -j-covcring number N. Then, using Lemma C.2. whenever m > cilogN, (5.1) will 
hold. What remains is to find N. To do this, we cover each individual di x ^2 submatrix and then take the 
union of the covers. For a fixed submatrix, using Lemma C.3, 4--covering number is given 
total there are (^^) x (^^) distinct submatrices. Consequently, by using log (^) « dlog j + d, we find 

log TV < log (^'^1^ X (^^^^ C^d^+d,)s^ < ^ + rf^ + ^2 log ^ + d2 + (rfi + d2)s log C, 

and obtain the desired result. ■ 
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